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We calculate the multilractal spectrum of the partition of the couphng space of a perceptron 

induced by random input-output pairs with non-zero mean. From the results we infer the influence 

' of the input and output bias respectively on both the storage and generalization properties of the 

\ network. It turns out that the value of the input bias is irrelevant as long as it is different from zero. 

^ ■ The generalization problem with output bias is new and shows an interesting two-level scenario. 

To compare our analytical results with simulations we introduce a simple and efScient algorithm to 

, implement Gibbs learning. 

m . PACS numbers: 87.10.-Fe, 02.50.Cw 
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I. INTRODUCTION 

C/3 ' The properties of simple networks of formal neurons can be quantitatively described by characterizing the partition 

■ of the couphng space induced by the required input-output mappings . In some cases the geometrical properties 
^ ; ■ of this partition can be concisely specified by the multifractal spectrum of the distribution of cells in coupling space 

, corresponding to the different output sequences that can be generated by the system for given inputs 1^,^. This 
' approach has been used for a number of investigations of both single-layer as well as multi-layer feed-forward neural 
I ] nets [7|-[l0| and revealed a number of interesting new aspects. The simplest case of the perceptron allows a rather 
' O ■ detailed analysis also highlighting the problems and subtleties of the method [0 . All investigations done so far have, 
^ I however, considered symmetric statistics of both inputs and outputs. 

In the present paper we analyze the multifractal properties of the phase space of a single-layer perceptron induced 
by input patterns with biased statistics. A possible bias of the outputs is taken care of by considering a special subset 
of cells only. The investigation is motivated by the fact the storage properties of a perceptron are known to depend 
markedly on the statistics of the patterns. A similar influence can be expected for the generalization behaviour which 
has to our knowledge not been studied for biased outputs before. 

The analysis is performed using the methods that have been employed for the case of unbiased patterns already. 
With the help of the replica trick the multifractal spectrum /(a) averaged over the distribution of inputs is calculated 
analytically for different pattern set sizes 7 . The storage and generalization properties are determined by the points 
^-H with slope and 1 respectively of these curves . The results for the storage capacity are compared with previous 
, findings whereas those for the generalization behaviour are checked against numerical simulations. In order to 

■ efficiently explore the version space in these simulations, we introduce a randomized variant of the perceptron learning 
"ti , algorithm. 

^ . II. CALCULATION OF THE CELL SPECTRUM 

^ : 

O We consider a spherical perceptron specified by a real coupling vector J obeying Jf = N and a set of p = 

input patterns with components drawn independently of each other from the distribution 



o 



> 
in 



X 



To every input the perceptron determines an output according to 

N 



<^^=sgn(^J.n=sgn(^E^^4'') 



(2) 
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Every set of input patterns therefore divides the couphng space into 2^ cells 



C{ct) = {J| = sgn(J • e) Va*} (3) 

labelled by the sequence of outputs ct^. Note that some of the cells may be empty. The size P{cr) = V{(t)/ J^t ^i''') 
of the cell gives the fraction of the coupling space that will produce the outputs cr^ given the inputs . 
It is convenient to characterise the cell sizes by a crowding index a{(T) defined by 

P(cr) = 2-^"^^) . (4) 

The entropy of the distribution of cell sizes in the thermodynamic limit averaged over the input patterns is then given 

by 

f{a) = lim l((log2 E^(" - "('^)))) (5) 

<J 

where ((. . .)) denotes the average over the input pattern distribution (^). Involving a trace over all output sequences 
(T this quantity cannot be used to elucidate the dependencies on the output bias. In fact an explicit calculation shows 
that it is also independent of the input pattern distribution giving results for /(a) identical to those for m = 0. The 
intuitive reason for this fact is that a cell chosen at random from the above ensemble will with probability 1 lie in 
the A'' — 1 dimensional subspace of the coupling space which is orthogonal to the direction of the bias. However the 
projection of the cell structure onto this subspace - whose properties dominate the cell spectrum in the thermodynamic 
limit - carries no bias. 

Hence in order to study the influence of the input and output statistics on the geometry of the phase space we have 
therefore restricted the cr-trace to outputs with magnetization m! according to 

/(a)=^lhn^l((log2^V-«H))) = JiiJ^^((log2E'^(:^E^''-'^')^^^ • (6) 

In the literature of multifractals f{a) is called the multifractal spectrum. It can be calculated by using its analogy 
with the microcanonical entropy of a spin system cr with Hamiltonian a(cr) and the free energy 

riq) = - hm l((log2 Z)) = lim l((log, E'^'('^))) = Ui^o&2 J^' 2-'^^"(^^)) . (7) 

cr cr 

The quenched average over input patterns with magnetisation m is performed using the pattern statistics (|l]). The 
entropy /(a) can now be obtained by a Legendre-transformation with respect to the inverse temperature q 



For the perceptron we have 



with the integral measure 



/(a) = mm[aq - T{q)] . (8) 
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P{ct)^ [ df,iJ)f[e{^a^J-e)- (9) 



d^,{3) = \{^5[N-i^) (10) 



ensuring the spherical constraint for the coupling vectors and the total normalisation over all cells = 1- 

9{x) is the Heaviside step function. In order to average log^P"^ over the input patterns we introduce two sets of 
replicas; one set labelled a = 1, . . . , n for the standard replica-trick to replace the log and one set labelled a = 1, . . . , g 
representing the q-th. power of P in (|^) . Thus we arrive at the replicated partition function 

Zn = {{zn) 

= ((En'5(E<-^W)/n^MJ") U^(^^a-e))). (11) 
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Averaging over the quenched disorder results in the order parameters 



as well as their conjugates (5"'f and M". In the following M" will be referred to as the weak magnetisation since 
the mean value of the couplings - the strong magnetisation l/iV^^ J°'" vanishes in the thermodynamic limit. The 
appearance of an additional order parameter describing a ferromagnetic bias of order ^fN was to be expected. To 
produce biased outputs the local fields J • ^'^/VN must have an average of order 1. Given ((^f )) = m this requires 
an average of the Ji of order y/N. The integral representation of the delta-function restricting the set of outputs 
introduces a further set of order parameters Ra- 
in the present paper we will only discuss the results obtained within the replica symmetric (RS) Ansatz Ru 



a,o 



{a,a) = {b,l3) 
a ~ b, a l3 

AC = M (12) 

Ra ~ 

This Ansatz represents the two- replica structure of the problem: Qi is assigned to the overlap between coupling 
vectors belonging to identical cells labelled by cr°, Qo is assigned to the overlap between different cells. For the cell 
spectrum without bias, it gives the correct result for < g < 1, which is the interval of interest here. Nevertheless it 
remains plagued by divergences for g < and continuous replica symmetry breaking for g > 1, for details sec ||Tl[| . 

Eliminating the conjugate order parameters one finds Qq = is always an extremum of r(g) since for Qo = the 
saddle point equation in Qo coincides with that in M. The interpretation of this result is that two randomly chosen 
coupling vectors with the same weak magnetisation do not have an overlap of order 1. 

We thus arrive at the free energy 



1 

T{q) = -j^extrg^^Af^fl 



^ ln(l -Ql) + l ln(l - (1 - q)Qi) 



--fm'R + jln{e" J DsHl+e^" J DsHl) 



where Ds = -^y= exp(— is the Gaussian integral measure and H{x) — Ds. 
Extremising T{q) with respect to Qi,M, and R yields the saddle point equations 

/ Dsie'^Hl-^ + e-^i/r')G2 _ Q, 



j DsieRHl+e-^Hl) I - {1 ~ q)Q 



j Ds{eRHl + e~RH''_) 
j Ds{e" Hi - e~^Hl) 
J Ds{eRHl+e-^Hl) 

The cell spectrum /(a) can now be evaluated using M). 



(14) 



where G— 1 / \/27r exp(— 2(1-0^*^^ ) 
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III. DISCUSSION 



For m' — the cell spectrum of unbiased patterns is recovered for any m. For m — no coupling vector with a weak 
magnetisation which produces a sequence of outputs with m' ^ exists. We thus turn to the case w' ^ 0, m 7^ 0. 
A transformation mM ^ M in the free energy (^3|) would remove the magnetisation of the inputs from the saddle 
point equations . Thus the properties of the cell spectrum for non-zero m and m' do not depend on m but only 
on 7 and m' whereas the weak magnetisation of the couplings Ai scales with for fixed m' and 7. Hence we may 
restrict the discussion to the case m = m' 7^ without loss of generality. 




FIG. 1. The multifractal spectrum f{a) for various values of the loading parameter 7 =0.2 (dotted), 0.5 (long dashed), 
1 (full), and 2 (dashed) and values of the magnetisation m' — 0,0.5,0.75 from top to bottom respectively. The parts with 
negative slope have been omitted since their interpretation is presently not clear (cf. [11]). 

Figure 1 shows the cell spectrum at several loading capacities 7 and magnetisations to'. For any given q = df /da 
the number of cells decreases exponentially with increasing m'. This is in accordance with the fact that the maximal 
possible number of cells with output bias to' scales as 2^^*°* with ftot = 7((1 — to')/2 log2(l — to')/2-|-(1-|-to')/2 log2(l + 
to')/2). The maximum fmax of f{ot) exponentially dominates the number of cells Af = f da2^^^'^\ hence a randomly 
chosen output sequence will result in a cell of size a{q = 0), which is termed a storage cell. For values of the loading 
parameter below the critical storage capacity 7c typically all possible cells may be realised, so fmax = ftot- However 
above the critical storage capacity only an exponentially small fraction of all possible cells may be realised hence 
fmax < ftot- Note that although fmax decreases with increasing to' as shown in figure 1 it does so slower than ftot 
so that the storage capacity increases. This can also be seen by comparing the curves of /(a) at 7 = 2 for to' = 
and to' = .75. The maximum of f{a) is attained for m' = as a — > 00 indicating the cell volume shrinks to zero 
which signals the critical storage capacity. However for to' = .75 the maximum of f{a) is reached at finite values of 
a indicating a finite size of a storage cell at 7 = 2. In fact the limit g ^ of ([l^) can be taken explicitly yielding the 
saddle point equations for the storage problem for magnetised patterns H . 

On the other hand the cells dominating the volume V = J da2^'^^''°'^~^ of the coupling space are characterised by 
(7=^ = 1. In general such a cell is taken to describe the generalisation behaviour of the perceptron, since in the 
thermodynamic limit a randomly chosen teacher perceptron will lie within a cell of this size with probability one. The 
saddle point equations ( p^ ) at this point are 

Qi = -f [ Ds{H^^ + HZ^)G^ (15) 
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R = 

m' = 1- 2H{mM) 



(16) 
(17) 



The interpretation of this saddle point may not be immediately obvious, since we have specified the magnetisation of 
the outputs, but no properties of a teacher that will produce such a set of outputs on a given input pattern. However 
equation provides an explicit relation between m, m' , and M. It gives the weak magnetisation M of the couplings 
of a teacher or student required to produce a magnetisation m' of the outputs given a set of inputs with magnetisation 
m. In the context of a teacher with magnetisation M acting on a set of inputs with magnetisation m ( |l7|) simply 
follows from the central limit theorem and Qi describes the overlap between teacher and student after the student 
has learned to classify 7A'^ examples correctly. The subsequent generalisation error eg is given by 




5 10 5 10 



y y 

FIG. 2. The teacher-student overlap Qi and the resulting generalisation error as a function of 7 for m' — 0, 0.5, 0.75 (from 
top to bottom). The uppermost curve corresponds to the results of Gyorgyi and Tishby [12]. The full lines are analytical 
curves whereas the symbols are the results of numerical simulations with A'' — 200 averaged over 200 patterns. The symbol size 
corresponds to 5 times the statistical error. 

The full lines in fig. 2 show the overlap Qi between student and teacher and the corresponding generalisation error eg 
as a function of 7 after the student has learned jN examples for different magnetisations m' = m. The generalisation 
error is found to decrease for increasing m' at fixed 7. The effect is most pronounced at low values of 7; in particular 
we find 6^(7 = 0,m') = . In fact the weak magnetisation of the couplings M is independent of the loading 

parameter and already takes on its non-zero value (for m ^ 0) at 7 = 0. However two independent output strings al^ 
and a J with the same average m' differ on average in (1 — m'^)/2 bits only. Hence eg < .5 even for zero teacher-student 
overlap Qi. Qualitatively this means that the student learns the correct bias of the outputs after a non-extensive 
number of examples already. A plausibility argument underlining this effect is a follows: By the central limit theorem 
the sum 1/v^X^i "^i^i ^ Gaussian variable with mean mM and variance 1. Hence equation ( |l7| ) needs to be fulfilled 
if an output of the same sign as m' is to be produced with probability (1-f m')/2. If the number of examples p becomes 
infinite in the thermodynamic limit the number of outputs of the same sign as m' tends to p times this probability. 
This implies that it is sufhcient for the number of examples to scale with , < (5 < 1, so the number of examples 
is infinite in the thermodynamic limit, for M to take on its saddle point value. 
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IV. SIMULATION ALGORITHM AND NUMERICAL RESULTS 



Gibbs learning at T = is a convenient tool for the analytical study of generalisation problems, since it characterizes 
the typical performance of a compatible student. It is however not completely straightforward to implement in 
numerical simulations. The necessary average over the compatible students cannot be performed directly because 
the version space is only an exponential fraction of the high-dimensional coupling space of the perceptron. Several 
methods to circumvent this problem have been suggested, including a random walk in the version space of the student 
, a billiard in version space [T^ , or a variant of the Adatron algorithm, where in each realisation a few randomly 
chosen patterns are learnt in addition to the examples IT^ ]. 

Here we propose the randomized perceptron algorithm as a new method to effectively simulate Gibbs learning 
and apply it to the specific problem of biased patterns. Starting with all couplings equal to zero, the randomized 
perceptron algorithm runs over all examples, leaving the couplings unchanged if the pattern is classified correctly by 
their present values. If a pattern is not classified correctly, the algorithm adds the standard Hebbian term as well as 
a random vector with components chosen independently from a Gaussian distribution. Since for large N the random 
vectors are all orthogonal to each other, the coupling vector will end up in version space even though the amplitudes 
of the Hebbian and the random terms are of comparable magnitude. This procedure slows down the convergence of 
the perceptron algorithm but leads to more reliable results for the generalisation error. The standard deviation of 
the noise term was 4.87 (compared to the magnitude of the Hebbian term of 27), but no strong dependence on the 
standard deviation was observed on the interval 27 to IO7. 

The simulations whose results are shown in Fig. 2 were performed with a system size of = 200. Sets of Gaussian 
distributed inputs with magnetisation m were generated and the components of the couplings of potential teacher 
perceptrons were chosen from a Gaussian distribution with zero mean. In this way each teacher perceptron was given 
a weak magnetisation. Teachers were generated until one of them produced an output magnetisation m' on the given 
inputs. This teacher was used to generate the outputs used in the subsequent step: The resulting patterns were taught 
to the student using the randomized perceptron algorithm and its overlap with the teacher and the generalisation 
error were evaluated. 

Except at large values of the magnetisation m', where finite size effects are more noticeable, the numerical results 
are in very good agreement with the analytical curves. 

V. SUMMARY 

In the present paper we have investigated the influence of a bias in the distribution of inputs and outputs on the cell 
structure in the phase space of a perceptron. To this end the multifractal spectrum /(a) was calculated analytically 
for different values of the storage ratio 7 with the help of the methods used already for the case of unbiased patterns. 
Both the storage and the generalization properties can be read off from the behaviour of /(a). 

For the storage problem we showed that the input bias has no influence on the storage capacity provided the output 
bias m' is equal to zero. If both input and output bias are non-zero the storage capacity 7c increases with increasing 
output bias m' irrespectively of the value of the input bias. Biased patterns have been considered before in the context 
of the phase space volume of attractor neural networks In this case it is natural to assume m ~ m' . Our results 
show that for the perceptron this case is in fact generic as for non-zero m the properties of the entire cell spectrum 
only depend on m' . The case m — but m' ^ cannot be realized by a perceptron with weak magnetisation of the 
couplings without thresholds ]l6[ | and thus was not treated here. The behaviour of the maximum fmax of /(a) as a 
function of m' generalize the results about the number of dichomoties [Q to m' ^ 0. 

For the generalization problem we found an interesting two-level scenario of learning. The student first determines 
the weak magnetization of the teacher couplings necessary to produce outputs of the required bias. This is accom- 
plished after a non-extensive number of training examples already. The curves for the generalization error therefore 
start off at 7 = with values smaller than .5. In the second step the student then reduces the generalization error 
further in the usual way. The asymptotic behaviour is not modified by the bias of the patterns. The analytical 
results are in excellent agreement with numerical simulations. We have found a randomized variant of the perceptron 
algorithm a simple device for reliable simulations of Gibbs learning. 

Biased patterns can be viewed as the simplest example of a hierarchy of inputs. It would be interesting to see 
whether the generalization strategy observed can also be found in the general case of hierarchically correlated pat- 
terns in the sense that the student first learns the classes of patterns and only then the individual representatives. 
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